$$ P(X)=\ 1-P(X)=\ =e^{0+1X}
$$
\(p_k(x)\) is the posterior probability that an observation X=x belongs to the the kth class and it is: \[ p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\delta}exp(-\frac{1}{2\delta^2}(x-\mu_k)^2)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2\delta^2}(x-\mu_l)^2)} \]
Then we take the log for both sides:
*Notice that the denominator is a constant for the observations in each class, so we set it \(C\).
\[ \log p_k(x)=\frac{\log \pi_k+\log\frac{1}{\sqrt{2\pi}\delta}-\frac{x^2}{2\delta^2}-\frac{\mu_k^2}{2\delta^2}+\frac{x\mu_k}{\delta^2}}{C} \] Here we can see the second term and the third term is the same for all classes, so maximizing \(\delta_k(x)\) in (4.13) is equivalent to maximize \(\log p_k(x)\). Since log is an increasing function when it is positive, \(p_k(x)\) in (4.12) is equivalent to (4.13).
We can write posterior probability of the k-th class: \[ p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\delta_k}exp(-\frac{1}{2\delta^2}(x-\mu_k)^2)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2\delta_k^2}(x-\mu_l)^2)} \] Then we can take the log, which is similar to (3) and the denominator is still a constant C. \[ \log p_k(x)=\frac{\log \pi_k+\log\frac{1}{\sqrt{2\pi}\delta_k}-\frac{x^2}{2\delta_k^2}-\frac{\mu_k^2}{2\delta^2}+\frac{x\mu_k}{\delta^2}}{C} \] Notice the third term and it is a quadratic function of x. So Bayes’s classfier not linear.
On average, 1/10 will be used to make the prediction because observations are uniformly distribution.
On average, when p=2, \(\frac{1}{10}*\frac{1}{10}=\frac{1}{100}\)
On average, when p=100, \((\frac{1}{10})^{100}\)
Suppose on each dimension, we uniformly divided range of [0, 1] to 100 bins. If we want to find the nearest points within 10% of range of X1, X2 thourgh X100 (given p=100), then we need 10100 of data to cover the range. Therefore, as the p increases, the number of observations needed grows exponentially.
When there are a large number of dimensions, the percentage of observations that can be used to predict with KNN becomes very small. This means that for a set sample size, more features leads to fewer neighbors.
*p=1: side length is 1.
*p=2: side length is \((0.1)^\frac{1}{2}\)
*p=n: side length is \((0.1)^\frac{1}{n}\)
QDA better on training set, LDA better on test set.
QDA better on both sets.
As sample size increases, test prediction aacuracy of QDA is improved relative to LDA.
Reason: QDA is more flexible than LDA, so when data size incrases, it can fit better.
False. QDA tend to be overfitting when the decision boundary is just linear.
$$ \begin{aligned} p(Y=A|X_1=40,X_2=3.5)&=
\end{aligned} $$
exp(-6+0.05*40+1*3.5)/(1+exp(-6+0.05*40+1*3.5))
## [1] 0.3775407
The probability for this student to get A is nearly 0.38.
\[ \frac{\log(\frac{p(x)}{1-p(x)}e^{-\beta_0-\beta_2X_2})}{\beta_1}=X_1 \]
log(0.5/0.5*exp(6-3.5))/0.05
## [1] 50
50 hours are needed.
There are two classes of response value. Here we can use Bayes’ therorem: \[ Pr(Y=Yes|X=x)=\frac{\pi_1 f_1(x)}{\sum_{l=1}^2\pi_l f_l(x)} \] Here \(\pi_1=0.8,\pi_2=0.2\),f_1(x) is the density function of a normal random variable with mean 10 and variance 36 and f_2(x) is the density function of a normal random variable with mean 0 and variance 36.
(0.8*dnorm(4,mean=10,sd=6))/(0.8*dnorm(4,mean=10,sd=6)+0.2*dnorm(4,mean=0,sd=6))
## [1] 0.7518525
Therefore, we can get: $$
\[\begin{aligned}
Pr(Y=Yes|X=4)&=\frac{\pi_1 f_1(4)}{\sum_{l=1}^2\pi_l f_l(4)}\\
&\approx0.75
\end{aligned}\]
$$ The probability that a company will issue a dividend this year given X=4 is about 0.75.
we want to solve p(x) \[ p(x)/(1-p(x))=0.37 \] so p(x) is about 0.27. It means in average, 27% people with odds 0.37 will in fact default.
The odds is \(0.16/(1-0.16)\approx0.19\)
library(ISLR)
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
pairs(~Year+Lag1+Lag2+Lag3+Lag4+Lag5+Volume+Today,Weekly)
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 appears to be statistically important.
glm.prob=predict(glm.fit,type='response')
glm.pred=rep('Down',1089)
glm.pred[glm.prob>0.5]='Up'
table(glm.pred,Weekly[,"Direction"])
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
# complete.cases(Weekly[,"Direction"])
Overfall fraction of correct predictions is \(\frac{54+557}{54+48+430+557} \approx0.56\)
The error rate of falsely predict down days to up is \(\frac{48}{48+54} \approx0.47\)
The error rate of falsely predict up days to up is \(\frac{430}{430+557} \approx0.44\)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
train=filter(Weekly,Year>=1990&Year<=2008)
test=filter(Weekly,Year>2008)
glm.fit1=glm(Direction~Lag2,data=train,family=binomial)
glm.prob1=predict(glm.fit1,test,type='response')
glm.pred1=rep('Down',104)
glm.pred1[glm.prob1>0.5]='Up'
table(glm.pred1,test[,"Direction"])
##
## glm.pred1 Down Up
## Down 9 5
## Up 34 56
mean(glm.pred1==test[,"Direction"])
## [1] 0.625
overall fraction of correct predictions for the held out data is \(0.625\).
library(dplyr)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
lda.fit=lda(Direction~Lag2,train)
lda.pred=predict(lda.fit,test)
lda.class=lda.pred$class
table(lda.class,test[,"Direction"])
##
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class==test[,"Direction"])
## [1] 0.625
overall fraction of correct predictions for the held out data is \(65/104\approx0.625\).
qda.fit=qda(Direction~Lag2,train)
qda.pred=predict(qda.fit,test)
qda.class=qda.pred$class
table(qda.class,test[,"Direction"])
##
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==test[,"Direction"])
## [1] 0.5865385
overall fraction of correct predictions for the held out data is \(61/104\approx0.59\).
library(class)
train.x=as.matrix(train$Lag2)
test.x=as.matrix(test$Lag2)
train.y=train$Direction
set.seed(1)
knn.pred=knn(train.x,test.x,train.y,k=1)
table(knn.pred,test$Direction)
##
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==test[,"Direction"])
## [1] 0.5
overall fraction of correct predictions for the held out data is \(0.5\).
Logistic regression and LDA provide the best results.
Here we conduct experiments with KNN.
k=c(1,3,5,10,15,20,30)
rate=rep(0,7)
a=1
for(i in k){
knn.pred=knn(train.x,test.x,train.y,k=i)
table(knn.pred,test$Direction)
rate[a]=mean(knn.pred==test[,"Direction"])
a=a+1
}
rate
## [1] 0.5000000 0.5576923 0.5384615 0.5769231 0.5865385 0.5769231 0.5576923
When k value is between 10 to 30, we can most likely obtain the best results.
med=median(Auto$mpg)
mpg01=rep(0,dim(Auto)[1])
mpg01[Auto$mpg>=med]=1
new_Auto=data.frame(Auto,mpg01)
pairs(new_Auto)
mpg weight horsepower acceleration displacement are most likely useful in predicting mpg01
set.seed(1)
tr <- sample(1:nrow(new_Auto), nrow(new_Auto)*0.7 , replace=F) # 70% train, 30% test
train <- new_Auto[tr,]
test <- new_Auto[-tr,]
lda.fit=lda(mpg01~mpg+weight+horsepower+acceleration+displacement,train)
lda.pred=predict(lda.fit,test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)
##
## lda.class 0 1
## 0 55 1
## 1 6 56
Test error rate is 0
qda.fit=qda(mpg01~mpg+weight+horsepower+acceleration+displacement,train)
qda.pred=predict(qda.fit,test)
qda.class=qda.pred$class
table(qda.class,test$mpg01)
##
## qda.class 0 1
## 0 55 2
## 1 6 55
mean(qda.class==test$mpg01)
## [1] 0.9322034
Test error rate is about 0.03.
glm.fit=glm(mpg01~mpg+weight+horsepower+acceleration+displacement,train,family=binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.prob=predict(glm.fit,test,type='response')
glm.pred=rep(0,93)
glm.pred[glm.prob>0.5]=1
table(glm.pred,test$mpg01)
##
## glm.pred 0 1
## 0 60 0
## 1 0 57
Test error rate is about 0
library(class)
train.x=dplyr::select(train,mpg,weight,horsepower,acceleration,displacement)
train.y=dplyr::select(train,mpg01)[,1]
test.x=dplyr::select(test,mpg,weight,horsepower,acceleration,displacement)
test.y=dplyr::select(test,mpg01)[,1]
min_error_rate=1000
k=0
for (i in 1:200){
knn.pred=knn(train.x,test.x,train.y,i)
t=table(knn.pred,test$mpg01)
error_rate=(t[2]+t[3])/sum(a)
if(error_rate<min_error_rate){
min_error_rate=error_rate
k=i
}
}
c(min_error_rate,k)
## [1] 1.5 3.0
##There may be multiple functions with the same name in multiple packages. The double colon operator allows you to specify the specific function you want:
KNN performs best around 29.
Power<-function(){
print(2^3)
}
Power2<-function(x,a){
print(x^a)
}
Power2(3,8)
## [1] 6561
a<-data.frame(c(3,8),c(8,17),c(131,3))
for(i in 1:3){
Power2(a[1,i],a[2,i])
}
## [1] 6561
## [1] 2.2518e+15
## [1] 2248091
Power3<-function(x,a){
result=x^a
return(result)
}
x=1:10
y=rep(0,10)
for(i in 1:10){
y[i]=Power3(i,2)
}
plot(x,y,log='x')
plot(x,y,log='y')
plot(x,y,log='xy')
PlotPower<-function(x,a){
y=rep(0,length(x))
for(i in 1:10){
y[i]=Power3(x[i],a)
}
plot(x,y)
}
PlotPower(1:10,3)